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Distributed Hierarchical Scheduling and Arbitration for Bandwidth Allocation 

continual growth of demand for manageable bandwidth in networks requires the development of new 
niques in switch design which decouple the complexity of control from the scale of the port count and 
aggregate bandwidth. This paper describes a switch architecture and a set of methods which provide the 
means by which switches of arbitrary size can be constructed while maintaining the ability to allocate 
guaranteed bandwidth to each possible connection through the switch. 

A digital switch is used to route data streams from a set of source components to a set of destination 
components. A cell based switch operates on data which is packetised into streams of equal size cells. In a 
large switch, the routing functions can be implemented hierarchically: sets of lower bandwidth ports are 
aggregated into a smaller number of higher bandwidth ports which are then interconnected in a central 
switch. This document also describes how the bandwidth allocation technique can be applied in such a 
hierarchical switch. 



A scheduling and arbitration process for use in a digital data switching arrangement of the type in which a 
central switch under the direction of a master control provides the cross-connection between a number of 
high-bandwidth ports to which are connected on the ingress side of the central switch ingress multiplexers, 
one for each high-bandwidth input port and on the egress side of the central switch one egress demultiplexer 
for each high-bandwidth output port, each ingress multiplexer includes a set of N input queues serving N 
low-bandwidth data sources and a set of M virtual output queues one for each low-bandwidth output data 
source characterised in that the scheduling and arbitration arrangement includes three bandwidth allocation 
tables, the first of which the, ingress port table (IPT), is associated with the input queues having NxM 
entries each arranged to define bandwidth allocation for a particular virtual output queue, the second of 
which, the egress port table (EPT), is associated with the virtual output queues having M entries each 
arranged to define the bandwidth allocation of a high-bandwidth port of the central switch to a virtual 
output queue and a third of which, the central allocation table (CAT) located in the master control having 

(M/N) entries each of which specifies the weights allocated to each possible connection through the central 
switch. 

According to a feature of the invention there is provided a scheduling and arbitration process in which the 
scheduling of the input queues is performed in accordance with an N way weighted round robin. 

According to a further feature of the invention there is provided an implementation of the N weighted round 

robin by a N.(2 -1) way unweighted round robin where W is the number of bits defining a weight using a 

w 

list constructed by interleaving N words of (2 -1) bits each, with w Ys in a word where w is the weight of 

it n 

the queue n. 



Figure 1 shows a hierarchical switch. The central interconnect ® provides the cross-interconnect between a 
number of high bandwidth ports. A set of multiplexers on the ingress side and demultiplexers on the 
egress side provide the aggregation function between the low and high bandwidth ports. The low bandwidth 
ports provide connections from the switch to the data sources on the ingress side and the data 
destinations on the egress side. In practice, a switch is required to support full-duplex ports, so an ingress 
multiplexer and its corresponding demultiplexer can be considered a single full-duplex device which will be 
termed here a Router. 



It should be noted that the central interconnect © may itself be a hierarchical switch, i.e., the methods 
described in this document can be applied to switches with an arbitrary number of hierarchical levels. 

The aim of these methods is to provide a mechanism whereby the data stream from the switch to a particular 
destination, which comprises of a sequence of cells interleaved from various data sources, can be controlled 
such that predetermined proportions of its bandwidth are guaranteed to cells from each data source. 



iddition to the data flow indicated by the arrows in figure 1 above, there is also a flow of backpressure or 
^-control information associated with each of the data flows. This control-flow is indicated in figure 2 
lashed arrows. 



Bandwidth Allocation in the Ingress Multiplexer 

An ingress multiplexer receives a set of data streams from the data sources via a set of low bandwidth in- 
ports. Each data stream is a sequence of equal size cells (equal number of bits of data). Figure 2 shows the 
architecture of the ingress multiplexer. 

A set of N low bandwidth ports © each fill one of N input queues . An Ingress Control Unit extracts 
the destination addresses from the cells in the input queues and transfers them into a set of M virtual output 
queues. There is one virtual output queue for each low-bandwidth output port in the switch. 
The ingress multiplexer contains an NxM entry Ingress Port Table (IPT) which defines the how its 
bandwidth to a particular egress ports (via a particular virtual output queue) is distributed across the input 
ports. This table is used by the Ingress Control Unit to determine when (and to what degree) to exert 
backpressure to the data source resolved down to an individual virtual output queue. 

The ingress multiplexer sends control information to the central interconnect indicating the state of the 
virtual output queues {connection requests). The central interconnect responds with a sequence of 
connections which it will establish between the routers (connection grants). The ingress multiplexer must 
now allocate the bandwidth to each egress demultiplexer provided by the central interconnect across the 
virtual output queues associated with each egress demultiplexer. The ingress multiplexer contains an 
Interconnect Link Control Unit ILCU which implements this function by scheduling cells from the 
virtual output queues across the high bandwidth link to the central interconnect according to an M entry 
Egress Port Table (EPT) . 

Weighted Round Robin Implementation 

The deterministic scheduling function of the ILCU can be defined as a weighted round robin arbiter (WRR). 
The ILCU receives a connection grant to a particular egress demultiplexer from the central interconnect and 
must select one of the N virtual output queues associated with that egress demultiplexer. This can be 
implemented by expanding the N way WRR into a (N- (2 W -1)) way unweighted round robin, where 
W=(number of bits to define weight) such that is a queue has a weight of w, then it is represented as vv-7 
entries in the unweighted round robin list (figure 3). The unweighted round robin list is constructed by 
interleaving N words (e„) of (2 W -1) bits each, with w n l's in a word where w„ is the weight of the queue n as 
shown in Figure 3. 

e.g., with 4 bit weights, a 4 way weighted round robin expands to a 60 way unweighted round robin. 

In order to optimise the service intervals to the queues under all weighting conditions, the entries in the 
unweighted round robin list are distributed such that for each weight the entries are an equal number of 
steps apart (±1 step). 

Table 1 shows an example of such an arrangement for 3 bit weights: 



Table 1 






e„ 


1 


1000000 


2 


1000100 


3 


1001010 


4 


1010101 


5 


1011011 


6 


1110111 


7 


1111111 



In the Terachannel the arbiter must select one of 9 queues with 4 bit weights: 8 VOQs as described above 
and a multicast queue. This expands to a 135 entry unweighted round robin. The implementation of a large 
unweighted round robin arbiter can be achieved without resorting to a slow iterative shift-and-test method 
by "divide and conquer" - the 1 35 entry round robin is segmented into 9 off 1 6 entry round robins (as 
shown in figure 4) each of which can be implemented efficiently with combinatorial logic (9x 1 6 provides 
upto 144 entries, so the multicast queue (upto 24 entries) actually can be allocated more bandwidth than an 
individual unicast queue (upto 15 entries)). 



Figure 4 illustrates the partitioning of the round robin arbiter. 

The sorter ® separates the request vector V (144 bits) into 9 off 16 bit vectors (vO-8). It also creates the 9 
off pointers (p0-8) for each of the 16 bit round robin blocks . The block which corresponds to the existing 
pointer (which has been saved in register ) gets a ' 1' at the corresponding bit position, while the other 
blocks get a dummy pointer initialised to location zero. 

Each 16 bit round robin block now find the next ' 1' in its input vector and output its location (g), whether it 
had to wrap round (w) and whether it found a ' T in its vector (/). A selector can now identify the block 
which has found the * 1' corresponding to the next * F in the original 135 bit vector given a signal (s) from 
the sorter which specifies which round robin block had the original pointer position. The selector itself is a 
round robin function which can be implemented as combinatorial logic: 

Find the next block starting at s which has w = false and / = true (if not found, select s). 



Figure 5 shows an example of the above process, but for a smaller configuration for clarity, 
(V = 1 2 bits, P = 4 bits, vO-2 = 4 bits, pO-2 = 2 bits, gO-2 = 2 bits). 

Bandwidth Allocation in the Central Interconnect 

The central interconnect provides the cross-connect function in the switch. The bandwidth allocation in the 
central interconnect is defined by an (M/N) 2 entry Central Allocation Table (CAT) which specifies the 
weights allocated to each possible connection through the central interconnect (the central interconnect has 
M/N high bandwidth ports). 

A technique for bandwidth allocation in the central interconnect is described in another patent application 
Probabilistic Masking for Bandwidth Allocation, (not yet fded) 



Bandwidth Allocation Tables (IPT, EPT, CAT) Programming 

- what goes into the tables? 

The Central Allocation Table (CAT) contains P 2 entries, where P=(M/N). Each entry w ie defines the weight 
allocated to the connection from high bandwidth port / to high bandwidth port e. However, not all 
combinations of entries constitute a self consistent set, i.e., the allocations as seen from the outputs could 
contradict the allocations as seen from the inputs. A set of allocations is only self consistent if the sums of 




weights at each output and input are equal. Figure 6 shows a self consistent and a non self consistent set of 
cations for a 4 port interconnect with 3 bit weights. 



Assuming that the CAT has a self consistent set of entries, it is possible to define the bandwidth allocation 
to a link between input port / and output port e with weight w ie as p ie : 

_ Wie 
Pie " f(p-0 >| 

V »-o J 



The Egress Port Table (EPT) defines how the bandwidth of a high bandwidth port to the central 
interconnect is allocated across the virtual output queues. There is no issue with self consistency (all 
possible entries are self consistent), so the bandwidth allocation for a virtual output queue v with weight w v 
is given by: 



Wv 



V «=o J 



Similarly, the Ingress Port Table (IPT) entries allocation the bandwidth of a virtual output queue to the 
ingress ports with port f with weight My given: 

Wf 



Pf 



'(N-l) 



Wn 



«=0 



Therefore the proportion of bandwidth at an egress port v allocated to an ingress port /is given by: 

Pfi= Pf Pv Pie 




Managing Bandwidth Allocation Tables 

- where do the weights come from? 

In a switch which is required to maintain strict bandwidth allocation between ports (such as an ATM 
switch), the tables are setup via a Switch Management Interface (SMI) from a Connection Admission and 
Control (CAC) processor. When the CAC has checked that it has the resources available in the switch to 
satisfy the connection request, it can modify the IPT, EPT and CAT to reflect the new distribution of traffic 
through the switch. 

In contrast, a switch may be required to provide a "best effort" service. In this case, the table entries are 
derived from a number of local parameters. Two such parameters are l v {length of virtual output queue v) 
and tty {urgency of virtual output queue v). Queue urgency is a parameter which is derived from the headers 
of the cells entering the queue from the ingress ports 

A switch can be implemented which can satisfy a range of requirements (including the two above) by 
defining a Weighting Function which "mixes" a number of scheduling parameters to generate the table 
entries in real time according to a set of sensitivities to length, urgency and pseudo-static bandwidth 




allocation (s lt s it , s s ). The requirement on the function are that it should btrTast and efficient since multiple 
ances occur in the critical path of a switch. In the Terachannel the weighting function has the form: 




Wv = 



^ lv pv Uv V / , \ 



Where b v is the backpressure applied from the egress Router, 

w v is the weight of the queue as applied to the scheduler, 
p v is a pseudo static bandwidth allocation (e.g., EPT entry). 



Backpressure in the Terachannel is described in another patent application Flow Control Architecture for 

a Digital Traffic Switch (not yet submitted) 

Despite the apparent complexity of this function, it can be implemented exclusively with an adder, 
multiplexers and small lookup tables, thus meeting the requirement for speed and efficiency. 

Features of this weighting function: 




• s/=1.0, ^=0.0, 5^=0.0 : Bandwidth is allocated locally purely on the basis of queue length, with a non- 



linear transfer function, so that the switch always attempts to avoid queues overflowing. 

• 5"/=0.0, j,= 1.0, ^,=0.0 : Bandwidth is allocated purely on the basis of pseudo-static allocation as 
described above. 

• ^/=0.0, j 5 =1.0, s u —0.5 : Bandwidth is allocated on the basis of pseudo-static allocation, but a data source 
is allowed to "push" some data flows harder (when the demand arises) by setting the urgency bits in the 
appropriate cell headers. 




mple 

re 7 is a block diagram of a small switch based on the above principles, showing the correct numbers of 
queues, table and table entries. 



e.g., for a hierarchical switch with 2 routers, each with 2 low-bandwidth ports (N-2, M=4), 



Assuming that each low-bandwidth port can transport lGbps of traffic, each high-bandwidth link can carry 
2Gbps and the switch is required to guarantee the following bandwidths allocations: 
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0.2 
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0.5 




0.2 
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0.1 


0.1 


0.6 


0.2 



The IPT, EPT and CAT tables would be set up by the CAC processor with the following 4 bit values (note 
that there will be rounding errors due to the limited resolution of the 4 bit weights): 
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(in Router CD): 
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(in Router CD): 
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